UC Berkeley Library and D-Lab
Fall 2017
It is often said that 80% of data analysis is spent on the process of cleaning and preparing the data. (Dasu and Johnson, 2003)
Provides values for life expectancy, GDP per capita, and population, every five years, from 1952 to 2007.
gapminder <- read.csv("data/gapminder-FiveYearData.csv", stringsAsFactors = TRUE)
head(gapminder)## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
We can use base R functions to calculate summary statistics across groups of observaitons:
mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])## [1] 2193.755
mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])## [1] 7136.11
mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])## [1] 7902.15
See how this might get tedious?
Luckily, the dplyr package provides a number of very useful functions for manipulating dataframes. These functions will save you time by reducing repetition.
install.packages('dplyr')Here we’re going to cover 6 of the most commonly used functions as well as using pipes (%>%) to combine them.
select()filter()group_by()summarize()mutate()arrange()Now let’s load the package:
library(dplyr)Imagine that we just received the gapminder dataset, but are only interested in a few variables in it. We could use the select() function to keep only the variables we select.
year_country_gdp <- select(gapminder, year, country, gdpPercap)
head(year_country_gdp)## year country gdpPercap
## 1 1952 Afghanistan 779.4453
## 2 1957 Afghanistan 820.8530
## 3 1962 Afghanistan 853.1007
## 4 1967 Afghanistan 836.1971
## 5 1972 Afghanistan 739.9811
## 6 1977 Afghanistan 786.1134
year_country_gdp <- select(gapminder, year, country, gdpPercap)
head(year_country_gdp)## year country gdpPercap
## 1 1952 Afghanistan 779.4453
## 2 1957 Afghanistan 820.8530
## 3 1962 Afghanistan 853.1007
## 4 1967 Afghanistan 836.1971
## 5 1972 Afghanistan 739.9811
## 6 1977 Afghanistan 786.1134
The pipe operator is one of dplyr’s greatest strengths. It allows you combine several functions in a chain.
year_country_gdp <- gapminder %>% select(year,country,gdpPercap)is equivalent to:
year_country_gdp <- select(gapminder, year, country, gdpPercap)Now let’s say we’re only interested in African countries. We can combine select and filter to select only the observations where continent is Africa.
year_country_gdp_euro <- gapminder %>%
filter(continent == "Africa") %>%
select(year,country,gdpPercap)A common task you’ll encounter when working with data is running calculations on different groups within the data. For instance, what if we wanted to calculated the mean GDP per capita for each continent? In base R, you would have to run the mean() function for each subset of data:
mean(gapminder$gdpPercap[gapminder$continent == "Africa"])## [1] 2193.755
mean(gapminder$gdpPercap[gapminder$continent == "Americas"])## [1] 7136.11
mean(gapminder$gdpPercap[gapminder$continent == "Asia"])## [1] 7902.15
mean(gapminder$gdpPercap[gapminder$continent == "Europe"])## [1] 14469.48
mean(gapminder$gdpPercap[gapminder$continent == "Oceania"])## [1] 18621.61
gapminder$mean.continent.GDP <- NA
gapminder$mean.continent.GDP[gapminder$continent == "Africa"] <- mean(gapminder$gdpPercap[gapminder$continent == "Africa"])
gapminder$mean.continent.GDP[gapminder$continent == "Americas"] <- mean(gapminder$gdpPercap[gapminder$continent == "Americas"])
gapminder$mean.continent.GDP[gapminder$continent == "Asia"] <- mean(gapminder$gdpPercap[gapminder$continent == "Asia"])
gapminder$mean.continent.GDP[gapminder$continent == "Europe"] <- mean(gapminder$gdpPercap[gapminder$continent == "Europe"])
gapminder$mean.continent.GDP[gapminder$continent == "Oceania"] <- mean(gapminder$gdpPercap[gapminder$continent == "Oceania"])gdp_bycontinents <- gapminder %>%
group_by(continent) %>%
summarize(mean_gdpPercap = mean(gdpPercap))
head(gdp_bycontinents)## # A tibble: 5 x 2
## continent mean_gdpPercap
## <fctr> <dbl>
## 1 Africa 2193.755
## 2 Americas 7136.110
## 3 Asia 7902.150
## 4 Europe 14469.476
## 5 Oceania 18621.609
group_by() multiple variables
gdp_bycontinents_byyear <- gapminder %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap = mean(gdpPercap))
head(gdp_bycontinents_byyear)## # A tibble: 6 x 3
## # Groups: continent [1]
## continent year mean_gdpPercap
## <fctr> <int> <dbl>
## 1 Africa 1952 1252.572
## 2 Africa 1957 1385.236
## 3 Africa 1962 1598.079
## 4 Africa 1967 2050.364
## 5 Africa 1972 2339.616
## 6 Africa 1977 2585.939
group_by() multiple variables and defining multiple variable with summarize()
gdp_pop_bycontinents_byyear <- gapminder %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop))
head(gdp_pop_bycontinents_byyear)## # A tibble: 6 x 6
## # Groups: continent [1]
## continent year mean_gdpPercap sd_gdpPercap mean_pop sd_pop
## <fctr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Africa 1952 1252.572 982.9521 4570010 6317450
## 2 Africa 1957 1385.236 1134.5089 5093033 7076042
## 3 Africa 1962 1598.079 1461.8392 5702247 7957545
## 4 Africa 1967 2050.364 2847.7176 6447875 8985505
## 5 Africa 1972 2339.616 3286.8539 7305376 10130833
## 6 Africa 1977 2585.939 4142.3987 8328097 11585184
What if we wanted to add these values to our original data frame instead of creating a new object?
gapminder_with_extra_vars <- gapminder %>%
group_by(continent, year) %>%
mutate(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop))
head(gapminder_with_extra_vars)## # A tibble: 6 x 11
## # Groups: continent, year [6]
## country year pop continent lifeExp gdpPercap
## <fctr> <int> <dbl> <fctr> <dbl> <dbl>
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
## # ... with 5 more variables: mean.continent.GDP <dbl>,
## # mean_gdpPercap <dbl>, sd_gdpPercap <dbl>, mean_pop <dbl>, sd_pop <dbl>
We can use also use mutate() to create new variables prior to (or even after) summarizing information.
gdp_pop_bycontinents_byyear <- gapminder %>%
mutate(gdp_billion = gdpPercap*pop/10^9) %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop),
mean_gdp_billion = mean(gdp_billion),
sd_gdp_billion = sd(gdp_billion))
head(gdp_pop_bycontinents_byyear)## # A tibble: 6 x 8
## # Groups: continent [1]
## continent year mean_gdpPercap sd_gdpPercap mean_pop sd_pop
## <fctr> <int> <dbl> <dbl> <dbl> <dbl>
## 1 Africa 1952 1252.572 982.9521 4570010 6317450
## 2 Africa 1957 1385.236 1134.5089 5093033 7076042
## 3 Africa 1962 1598.079 1461.8392 5702247 7957545
## 4 Africa 1967 2050.364 2847.7176 6447875 8985505
## 5 Africa 1972 2339.616 3286.8539 7305376 10130833
## 6 Africa 1977 2585.939 4142.3987 8328097 11585184
## # ... with 2 more variables: mean_gdp_billion <dbl>, sd_gdp_billion <dbl>
gapminder_with_extra_vars <- gapminder %>%
group_by(continent, year) %>%
mutate(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop)) %>%
arrange(desc(year), continent)
head(gapminder_with_extra_vars)## # A tibble: 6 x 11
## # Groups: continent, year [1]
## country year pop continent lifeExp gdpPercap
## <fctr> <int> <dbl> <fctr> <dbl> <dbl>
## 1 Algeria 2007 33333216 Africa 72.301 6223.3675
## 2 Angola 2007 12420476 Africa 42.731 4797.2313
## 3 Benin 2007 8078314 Africa 56.728 1441.2849
## 4 Botswana 2007 1639131 Africa 50.728 12569.8518
## 5 Burkina Faso 2007 14326203 Africa 52.295 1217.0330
## 6 Burundi 2007 8390505 Africa 49.580 430.0707
## # ... with 5 more variables: mean.continent.GDP <dbl>,
## # mean_gdpPercap <dbl>, sd_gdpPercap <dbl>, mean_pop <dbl>, sd_pop <dbl>
Use dplyr to create a data frame containing the median lifeExp for each continent
Use dplyr to add a column to the gapminder dataset that contains the total population of the continent of each observation in a given year. For example, if the first observation is Afghanistan in 1952, the new column would contain the population of Asia in 1952.
Use dplyr to: (a) add a column called gdpPercap_diff that contains the difference between the observation’s gdpPercap and the mean gdpPercap of the continent in that year, (b) arrange the dataframe by the column you just created, in descending order (so that the relatively richest country/years are listed first)
Even before we conduct analysis or calculations, we need to put our data into the correct format. The goal here is to rearrange a messy dataset into one that is tidy
The two most important properties of tidy data are:
“Happy families are all alike; every unhappy family is unhappy in its own way.” - Leo Tolstoy
“Tidy datasets are all alike but every messy dataset is messy in its own way.” – Hadley Wickham
wide## name time1 time2 time3
## 1 Wilbur 67 56 70
## 2 Petunia 80 90 67
## 3 Gregory 64 50 101
long## name time heartrate
## 1 Wilbur 1 67
## 2 Petunia 1 80
## 3 Gregory 1 64
## 4 Wilbur 2 56
## 5 Petunia 2 90
## 6 Gregory 2 50
## 7 Wilbur 3 70
## 8 Petunia 3 67
## 9 Gregory 3 10
Which one of these do you think is the tidy format?